Efficient supervised and semi-supervised approaches for affiliations disambiguation
Identifieur interne : 001592 ( Main/Exploration ); précédent : 001591; suivant : 001593Efficient supervised and semi-supervised approaches for affiliations disambiguation
Auteurs : Pascal Cuxac [France] ; Jean-Charles Lamirel [France] ; Valerie Bonvallot [France]Source :
- Scientometrics [ 0138-9130 ] ; 2013-10-01.
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Classification, Recherche scientifique.
- mix :
English descriptors
- KwdEn :
Abstract
Abstract: The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web…etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions… Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.
Url:
- https://api.istex.fr/ark:/67375/VQC-7LBM11HJ-9/fulltext.pdf
- https://hal.archives-ouvertes.fr/hal-00960435
DOI: 10.1007/s11192-013-1025-5
Affiliations:
- France
- Grand Est, Lorraine (région)
- Nancy, Vandœuvre-lès-Nancy
- Centre national de la recherche scientifique, Laboratoire lorrain de recherche en informatique et ses applications, Synalp (Loria), Université de Lorraine
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 001627
- to stream Istex, to step Curation: 001608
- to stream Istex, to step Checkpoint: 000218
- to stream Hal, to step Corpus: 001F05
- to stream Hal, to step Curation: 001F05
- to stream Hal, to step Checkpoint: 000E98
- to stream Main, to step Merge: 001604
- to stream PascalFrancis, to step Corpus: 000055
- to stream PascalFrancis, to step Corpus: 000095
- to stream PascalFrancis, to step Curation: 000952
- to stream PascalFrancis, to step Checkpoint: 000067
- to stream Main, to step Merge: 001738
- to stream Main, to step Curation: 001592
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Efficient supervised and semi-supervised approaches for affiliations disambiguation</title>
<author><name sortKey="Cuxac, Pascal" sort="Cuxac, Pascal" uniqKey="Cuxac P" first="Pascal" last="Cuxac">Pascal Cuxac</name>
</author>
<author><name sortKey="Lamirel, Jean Charles" sort="Lamirel, Jean Charles" uniqKey="Lamirel J" first="Jean-Charles" last="Lamirel">Jean-Charles Lamirel</name>
<affiliation><country>France</country>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="team" n="7">Synalp (Loria)</orgName>
<orgName type="lab">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="EPST">Centre national de la recherche scientifique</orgName>
</affiliation>
</author>
<author><name sortKey="Bonvallot, Valerie" sort="Bonvallot, Valerie" uniqKey="Bonvallot V" first="Valerie" last="Bonvallot">Valerie Bonvallot</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:5FD71571C0E911CDA4E18F3CBDA209703F76C106</idno>
<date when="2013" year="2013">2013</date>
<idno type="doi">10.1007/s11192-013-1025-5</idno>
<idno type="url">https://api.istex.fr/ark:/67375/VQC-7LBM11HJ-9/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001627</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001627</idno>
<idno type="wicri:Area/Istex/Curation">001608</idno>
<idno type="wicri:Area/Istex/Checkpoint">000218</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000218</idno>
<idno type="wicri:doubleKey">0138-9130:2013:Cuxac P:efficient:supervised:and</idno>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00960435</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-00960435</idno>
<idno type="wicri:Area/Hal/Corpus">001F05</idno>
<idno type="wicri:Area/Hal/Curation">001F05</idno>
<idno type="wicri:Area/Hal/Checkpoint">000E98</idno>
<idno type="wicri:explorRef" wicri:stream="Hal" wicri:step="Checkpoint">000E98</idno>
<idno type="wicri:doubleKey">0138-9130:2013:Cuxac P:efficient:supervised:and</idno>
<idno type="wicri:Area/Main/Merge">001604</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:13-0331130</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000055</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000095</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000952</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000067</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000067</idno>
<idno type="wicri:doubleKey">0138-9130:2013:Cuxac P:efficient:supervised:and</idno>
<idno type="wicri:Area/Main/Merge">001738</idno>
<idno type="wicri:Area/Main/Curation">001592</idno>
<idno type="wicri:Area/Main/Exploration">001592</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Efficient supervised and semi-supervised approaches for affiliations disambiguation</title>
<author><name sortKey="Cuxac, Pascal" sort="Cuxac, Pascal" uniqKey="Cuxac P" first="Pascal" last="Cuxac">Pascal Cuxac</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>INIST-CNRS, Vandoeuvre les Nancy</wicri:regionArea>
<placeName><region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
<author><name sortKey="Lamirel, Jean Charles" sort="Lamirel, Jean Charles" uniqKey="Lamirel J" first="Jean-Charles" last="Lamirel">Jean-Charles Lamirel</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>LORIA-Synalp, Vandoeuvre les Nancy</wicri:regionArea>
<placeName><region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="team" n="7">Synalp (Loria)</orgName>
<orgName type="lab">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="EPST">Centre national de la recherche scientifique</orgName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Grand Est</region>
<region type="region" nuts="2">Lorraine (région)</region>
</placeName>
<orgName type="team" n="7">Synalp (Loria)</orgName>
<orgName type="lab">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="EPST">Centre national de la recherche scientifique</orgName>
</affiliation>
</author>
<author><name sortKey="Bonvallot, Valerie" sort="Bonvallot, Valerie" uniqKey="Bonvallot V" first="Valerie" last="Bonvallot">Valerie Bonvallot</name>
<affiliation wicri:level="3"><country xml:lang="fr">France</country>
<wicri:regionArea>INIST-CNRS, Vandoeuvre les Nancy</wicri:regionArea>
<placeName><region type="region" nuts="2">Grand Est</region>
<region type="old region" nuts="2">Lorraine (région)</region>
<settlement type="city">Vandœuvre-lès-Nancy</settlement>
<settlement type="city" wicri:auto="agglo">Nancy</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Scientometrics</title>
<title level="j" type="sub">An International Journal for all Quantitative Aspects of the Science of Science, Communication in Science and Science Policy</title>
<title level="j" type="abbrev">Scientometrics</title>
<idno type="ISSN">0138-9130</idno>
<idno type="eISSN">1588-2861</idno>
<imprint><publisher>Springer Netherlands</publisher>
<pubPlace>Dordrecht</pubPlace>
<date type="published" when="2013-10-01">2013-10-01</date>
<biblScope unit="volume">97</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="47">47</biblScope>
<biblScope unit="page" to="58">58</biblScope>
</imprint>
<idno type="ISSN">0138-9130</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0138-9130</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Affiliation</term>
<term>Algorithm</term>
<term>Bibliographic database</term>
<term>Bibliographic databases</term>
<term>Citation analysis</term>
<term>Classification</term>
<term>Cluster</term>
<term>Clustering</term>
<term>Data cleaning</term>
<term>Disambiguation</term>
<term>K-means</term>
<term>Naive bayes</term>
<term>Research field</term>
<term>Scientific research</term>
<term>Scientometrics</term>
<term>Semantic analysis</term>
<term>Semi-supervised</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Algorithme</term>
<term>Amas</term>
<term>Analyse citation</term>
<term>Analyse sémantique</term>
<term>Base de données bibliographiques</term>
<term>Classification</term>
<term>Domaine recherche</term>
<term>Recherche scientifique</term>
<term>Scientométrie</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Classification</term>
<term>Recherche scientifique</term>
</keywords>
<keywords scheme="mix" xml:lang="fr"><term>Clustering</term>
<term>affiliations</term>
<term>classification automatique</term>
<term>désambiguisation</term>
<term>infométrie</term>
<term>texte</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: The disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web…etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions… Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Grand Est</li>
<li>Lorraine (région)</li>
</region>
<settlement><li>Nancy</li>
<li>Vandœuvre-lès-Nancy</li>
</settlement>
<orgName><li>Centre national de la recherche scientifique</li>
<li>Laboratoire lorrain de recherche en informatique et ses applications</li>
<li>Synalp (Loria)</li>
<li>Université de Lorraine</li>
</orgName>
</list>
<tree><country name="France"><region name="Grand Est"><name sortKey="Cuxac, Pascal" sort="Cuxac, Pascal" uniqKey="Cuxac P" first="Pascal" last="Cuxac">Pascal Cuxac</name>
</region>
<name sortKey="Bonvallot, Valerie" sort="Bonvallot, Valerie" uniqKey="Bonvallot V" first="Valerie" last="Bonvallot">Valerie Bonvallot</name>
<name sortKey="Bonvallot, Valerie" sort="Bonvallot, Valerie" uniqKey="Bonvallot V" first="Valerie" last="Bonvallot">Valerie Bonvallot</name>
<name sortKey="Cuxac, Pascal" sort="Cuxac, Pascal" uniqKey="Cuxac P" first="Pascal" last="Cuxac">Pascal Cuxac</name>
<name sortKey="Lamirel, Jean Charles" sort="Lamirel, Jean Charles" uniqKey="Lamirel J" first="Jean-Charles" last="Lamirel">Jean-Charles Lamirel</name>
<name sortKey="Lamirel, Jean Charles" sort="Lamirel, Jean Charles" uniqKey="Lamirel J" first="Jean-Charles" last="Lamirel">Jean-Charles Lamirel</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001592 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001592 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Lorraine |area= InforLorV4 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:5FD71571C0E911CDA4E18F3CBDA209703F76C106 |texte= Efficient supervised and semi-supervised approaches for affiliations disambiguation }}
This area was generated with Dilib version V0.6.33. |